This notebook shows the math behind a very fast NEON implementation of biquad IIR filters. One of the main primitives NEON is very highly optimized for is multiplying a matrix by a vector, especially of size 4.
A biquad filter has two state variables (this is most apparent in transposed direct form II). For this implementation, the input vector consists of two input values and the two state variables. The output vector is two output values and the new state vector. The relationship is linear, ie the transformation from input to output vector is simply a matrix-vector multiplication. Each iteration computes two samples, using a single 4x4 matrix-vector multiply.
In [117]:
%pylab inline
We start with a simple implementation of biquad for reference; we'll compare the error of the matrix approach against this scalar implementation. This is in direct form I.
In [118]:
def biquad(coefs, inp):
b0, b1, b2, a1, a2 = coefs
out = zeros(len(inp))
x1 = 0
x2 = 0
y1 = 0
y2 = 0
for i in range(len(inp)):
x = inp[i]
y = b0 * x + b1 * x1 + b2 * x2 - a1 * y1 - a2 * y2
out[i] = y
x2, x1 = x1, x
y2, y1 = y1, y
return out
As a starting step, here's the same biquad in a 2x2 matrix formulation. This version is derived from transposed direct form II, with a little expansion of the effect this input sample has on the next state vector (this is the B array below). The A matrix describes the evolution of the IIR state.
In [119]:
def bqmat(coefs, inp):
b0, b1, b2, a1, a2 = coefs
A = array([[-a1, 1], [-a2, 0]])
B = array([b1 - a1 * b0, b2 - a2 * b0])
y = array([0, 0])
out = np.zeros(len(inp))
for i in range(len(inp)):
x = inp[i]
out[i] = y[0] + x * b0
y = dot(A, y) + B * x
return out
Let's test it out.
In [120]:
impulse = zeros(100)
impulse[10] = 1
coefs = 1.0207, -1.7719, .9376, -1.7719, 0.9583 # peaking
plot(biquad(coefs, impulse))
plot(bqmat(coefs, impulse))
Out[120]:
In [121]:
plot(biquad(coefs, impulse) - bqmat(coefs, impulse))
Out[121]:
Note the scale of the error -- it's floating point rounding.
Now for the 4x4 matrix approach. It's really four 2x2 matrices tiled. The top left corner is the influence of the two input samples on the two output samples; it's the first two values of the biquad's impulse response, banded. The top right corner is the contribution of the state vector to the output. The bottom left is the effect of the input on the state, and the bottom right is the evolution of the IIR. Since each iteration is two samples, this matrix is simply the square of the A matrix above.
Note that each iteration processes two samples. The matrix-vector product is 16 multiply-accumulates, so it amortizes to 8 per sample. This compares well to the 5 multiply-accumulates of the direct form, but organized for extremely fast evaluation in SIMD, and NEON in particular.
In [122]:
def biquadm(coefs, inp):
b0, b1, b2, a1, a2 = coefs
c1 = b1 - a1 * b0
c2 = b2 - a2 * b0
A = array([[b0, 0, 1, 0],
[c1, b0, -a1, 1],
[-a1 * c1 + c2, c1, -a2 + a1*a1, -a1],
[-a2 * c1, c2, a1*a2, -a2]])
out = np.zeros(len(inp))
y = zeros(4)
for i in range(0, len(inp), 2):
y[0:2] = inp[i:i+2]
y = dot(A, y)
out[i:i+2] = y[0:2]
return out
Again, let's test it out.
In [123]:
plot(biquad(coefs, impulse))
plot(biquadm(coefs, impulse))
Out[123]:
The error is similar to that of the 2x2 matrix version, but not quite the same.
In [124]:
plot(biquadm(coefs, delta) - biquad(coefs, delta))
plot(biquadm(coefs, delta) - bqmat(coefs, delta))
Out[124]:
In [124]: